# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
loan_data = pd.read_csv('prosperLoanData.csv')
loan_data.head()
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
5 rows × 81 columns
loan_data.shape
(113937, 81)
loan_data.dtypes
ListingKey object
ListingNumber int64
ListingCreationDate object
CreditGrade object
Term int64
...
PercentFunded float64
Recommendations int64
InvestmentFromFriendsCount int64
InvestmentFromFriendsAmount float64
Investors int64
Length: 81, dtype: object
loan_data.describe()
| ListingNumber | Term | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperScore | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.139370e+05 | 113937.000000 | 113912.000000 | 113937.000000 | 113937.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | ... | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 |
| mean | 6.278857e+05 | 40.830248 | 0.218828 | 0.192764 | 0.182701 | 0.168661 | 0.080306 | 0.096068 | 4.072243 | 5.950067 | ... | -54.725641 | -14.242698 | 700.446342 | 681.420499 | 25.142686 | 0.998584 | 0.048027 | 0.023460 | 16.550751 | 80.475228 |
| std | 3.280762e+05 | 10.436212 | 0.080364 | 0.074818 | 0.074516 | 0.068467 | 0.046764 | 0.030403 | 1.673227 | 2.376501 | ... | 60.675425 | 109.232758 | 2388.513831 | 2357.167068 | 275.657937 | 0.017919 | 0.332353 | 0.232412 | 294.545422 | 103.239020 |
| min | 4.000000e+00 | 12.000000 | 0.006530 | 0.000000 | -0.010000 | -0.182700 | 0.004900 | -0.182700 | 1.000000 | 1.000000 | ... | -664.870000 | -9274.750000 | -94.200000 | -954.550000 | 0.000000 | 0.700000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 4.009190e+05 | 36.000000 | 0.156290 | 0.134000 | 0.124200 | 0.115670 | 0.042400 | 0.074080 | 3.000000 | 4.000000 | ... | -73.180000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 50% | 6.005540e+05 | 36.000000 | 0.209760 | 0.184000 | 0.173000 | 0.161500 | 0.072400 | 0.091700 | 4.000000 | 6.000000 | ... | -34.440000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 44.000000 |
| 75% | 8.926340e+05 | 36.000000 | 0.283810 | 0.250000 | 0.240000 | 0.224300 | 0.112000 | 0.116600 | 5.000000 | 8.000000 | ... | -13.920000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 115.000000 |
| max | 1.255725e+06 | 60.000000 | 0.512290 | 0.497500 | 0.492500 | 0.319900 | 0.366000 | 0.283700 | 7.000000 | 11.000000 | ... | 32.060000 | 0.000000 | 25000.000000 | 25000.000000 | 21117.900000 | 1.012500 | 39.000000 | 33.000000 | 25000.000000 | 1189.000000 |
8 rows × 61 columns
loan_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
loan_data['ListingCategory (numeric)'].unique()
array([ 0, 2, 16, 1, 7, 13, 6, 15, 20, 19, 3, 18, 8, 4, 11, 14, 5,
9, 17, 10, 12])
loan_data.LoanStatus.unique()
array(['Completed', 'Current', 'Past Due (1-15 days)', 'Defaulted',
'Chargedoff', 'Past Due (16-30 days)', 'Cancelled',
'Past Due (61-90 days)', 'Past Due (31-60 days)',
'Past Due (91-120 days)', 'FinalPaymentInProgress',
'Past Due (>120 days)'], dtype=object)
loan_data.query('LoanStatus=="Completed"')
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 11 | 0F1734025150298088A5F2B | 213551 | 2007-10-09 20:28:33.640000000 | C | 36 | Completed | 2008-01-07 00:00:00 | 0.15033 | 0.1325 | 0.1225 | ... | -0.88 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 53 |
| 21 | 1017340630713945364DA6B | 241498 | 2007-11-30 20:33:49.227000000 | D | 36 | Completed | 2010-12-12 00:00:00 | 0.21488 | 0.2075 | 0.1975 | ... | -51.12 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 53 |
| 26 | 0F4F3571508214190FD75CB | 713131 | 2013-02-12 18:07:31.690000000 | NaN | 60 | Completed | 2013-10-22 00:00:00 | 0.30748 | 0.2809 | 0.2709 | ... | -25.81 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 37 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113923 | E334343080922352965DF86 | 395877 | 2008-09-10 08:26:30.537000000 | B | 36 | Completed | 2011-09-23 00:00:00 | 0.22378 | 0.2089 | 0.1989 | ... | -135.50 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 270 |
| 113927 | E3433419834735803891976 | 322787 | 2008-04-30 21:25:19.670000000 | AA | 36 | Completed | 2011-05-09 00:00:00 | 0.07469 | 0.0679 | 0.0579 | ... | -68.47 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2 | 0 | 0.0 | 194 |
| 113928 | E34935176664905343E01EA | 510097 | 2011-06-06 19:02:44.443000000 | NaN | 36 | Completed | 2011-09-19 00:00:00 | 0.22362 | 0.1899 | 0.1799 | ... | -4.41 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 25 |
| 113929 | E3553583161337791FCB87F | 830413 | 2013-07-06 17:40:01.657000000 | NaN | 36 | Completed | 2014-02-07 00:00:00 | 0.30285 | 0.2639 | 0.2539 | ... | -6.54 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 26 |
| 113935 | E6EB3531504622671970D9E | 539056 | 2011-11-14 13:18:26.597000000 | NaN | 60 | Completed | 2013-08-13 00:00:00 | 0.28408 | 0.2605 | 0.2505 | ... | -235.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 274 |
38074 rows × 81 columns
loan_data.Recommendations.unique()
array([ 0, 2, 1, 4, 3, 9, 5, 16, 39, 21, 7, 14, 8, 6, 24, 19, 18])
# renaming some columns for consistensy
loan_data=loan_data.rename(columns = {'ListingCategory (numeric)':'ListingCategory', 'ProsperRating (numeric)':'ProsperRatingNumeric', 'ProsperRating (Alpha)':'ProsperRatingAlpha', 'TradesNeverDelinquent (percentage)':'TradesNeverDelinquentPercentage'})
# convert some colum features into ordered categorical types
ordinal_var_dict = {'LoanStatus': ['Current','Completed','Chargedoff','Defaulted',
'Past Due (1-15 days)', 'Past Due (16-30 days)', 'Past Due (31-60 days)', 'Past Due (61-90 days)',
'Past Due (91-120 days)', 'Past Due (>120 days)', 'FinalPaymentInProgress'],
'CreditGrade':['NC', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA'],
'ProsperRatingAlpha': ['NC', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA'],
'ProsperRatingNumeric': [1,2,3,4,5,6,7],
'IncomeRange': ['Not displayed', 'Not employed', '$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+']}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
loan_data[var] = loan_data[var].astype(ordered_var)
loan_data.ListingCategory.unique()
array([ 0, 2, 16, 1, 7, 13, 6, 15, 20, 19, 3, 18, 8, 4, 11, 14, 5,
9, 17, 10, 12])
cancelled_loans = loan_data.query('LoanStatus=="Cancelled"')
cancelled_loans.IsBorrowerHomeowner.head(10).T
Series([], Name: IsBorrowerHomeowner, dtype: bool)
completed_loans = loan_data.query('LoanStatus=="Completed"')
completed_loans.IsBorrowerHomeowner.value_counts()
False 19794 True 18280 Name: IsBorrowerHomeowner, dtype: int64
x= completed_loans.IsBorrowerHomeowner.unique()
y= completed_loans.IsBorrowerHomeowner.value_counts()
mycolor = sb.color_palette()[0]
myorder = completed_loans.IsBorrowerHomeowner.value_counts().index
sb.countplot(data=completed_loans, x='IsBorrowerHomeowner', color=mycolor, order=myorder)
<AxesSubplot:xlabel='IsBorrowerHomeowner', ylabel='count'>
defaulted_loans = loan_data.query('LoanStatus=="Defaulted"')
defaulted_loans.IsBorrowerHomeowner.value_counts()
False 2744 True 2274 Name: IsBorrowerHomeowner, dtype: int64
sb.countplot(data=defaulted_loans, x='IsBorrowerHomeowner', color=mycolor, order=myorder)
<AxesSubplot:xlabel='IsBorrowerHomeowner', ylabel='count'>
Chargedoff_data = loan_data.query('LoanStatus=="Chargedoff"')
sb.countplot(data=Chargedoff_data, x='IsBorrowerHomeowner', color=mycolor, order=myorder)
<AxesSubplot:xlabel='IsBorrowerHomeowner', ylabel='count'>
sum(loan_data.duplicated())
0
loan_data.ListingCreationDate.value_counts()
2013-10-02 17:20:16.550000000 6
2013-09-08 09:27:44.853000000 4
2013-08-28 20:31:41.107000000 4
2013-12-06 11:44:58.283000000 4
2013-12-06 05:43:13.830000000 4
..
2012-12-31 12:11:39.570000000 1
2012-12-31 13:41:24.307000000 1
2014-01-11 20:48:11.570000000 1
2013-08-28 16:01:57.693000000 1
2010-02-22 22:23:00.697000000 1
Name: ListingCreationDate, Length: 113064, dtype: int64
Your answer here!
- The dataset contains 113937 rows of data with 81 columns. (ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors)
The BorrowerRate or interest rate of the loans.
My focus of investigation is to determine what features are best for predicting Loan BorrowerRate in the dataset?
fig = plt.figure(figsize=[8,10])
color_base = sb.color_palette()[0]
sb.countplot(data= loan_data, y= 'BorrowerState',color=color_base);
We can see from the plot that California has the highest number of prosper loan borrowers followed by three other States. I will have to sort the BorrowerState programmatically to show a clear order of the states in order.
The pandas series function value counts is valuable here. This function counts the frequency of each unique value in the series and then sort them in descending order. In order to get the bar order, we want the index values and then, store them in a variable. We pass this to the order parameter in order to get the sorted bar chart as shown below:
#Ordering the borrowerstate to get a clear picture of the distribution
myorder = loan_data.BorrowerState.value_counts().index
fig = plt.figure(figsize=[15,10])
color_base = sb.color_palette()[0]
sb.countplot(data= loan_data, x= 'BorrowerState',color=color_base,order=myorder);
With this we can see the distribution clearly, California having the highest count and then Texas and then Newyork and Florida having an even count. The three later states have almost half the count of California.
As a side note it is good to note that. Prosper Marketplace, Inc. is a San Francisco, California-based company in the peer-to-peer lending industry.
loan_data.CreditGrade.unique()
['C', NaN, 'HR', 'AA', 'D', 'B', 'E', 'A', 'NC'] Categories (8, object): ['NC' < 'HR' < 'E' < 'D' < 'C' < 'B' < 'A' < 'AA']
fig = plt.figure(figsize=[10,8])
color_base = sb.color_palette()[0]
sb.countplot(data= loan_data, x= 'CreditGrade',color=color_base);
From the data dictionary we can see that the CreditGrade is ordinal data of rating assigned to the loan listing when the listing went live. The order being as follows in ascending order:
0 - NC, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA.
This means AA is the highest rating to be given to a listing. Let's list the listing in a descending order from the highest to the lowest.
myorder = ['AA','A','B','C','D','E','HR','NC']
fig = plt.figure(figsize=[10,8])
color_base = sb.color_palette()[0]
sb.countplot(data= loan_data, x= 'CreditGrade',color=color_base, order=myorder);
df_original = loan_data.shape[0]
df_original
113937
loan_data.CreditGrade.value_counts()
C 5649 D 5153 B 4389 AA 3509 HR 3508 A 3315 E 3289 NC 141 Name: CreditGrade, dtype: int64
The distribution is Unimodal in nature, and We can see CreditGrade 'C' has the highest count followed by 'D' and then 'B'.
Further investigation will need to be made on the relationship between this variable and BorrowerRate in bivariate investigation. To see what CreditGrade has the highest BorrowerRates.
#loan_data.ListingCreationDate.dt.year
y = pd.DatetimeIndex(loan_data['ListingCreationDate']).year.value_counts()
plt.bar(y.index,y)
plt.xticks(y.index);
From the data dictionary we can note that ListingCategory is a categorical variable. The category of the listing that the borrower selected when posting their listing:
Lets investigate this variable further to see which category has the highest count.
fig = plt.figure(figsize=[10,8])
color_base = sb.color_palette()[0]
sb.countplot(data=loan_data, x='ListingCategory', color=color_base);
xlocs = list(loan_data.ListingCategory.value_counts().index)
print(xlocs)
[1, 0, 7, 2, 3, 6, 4, 13, 15, 18, 14, 20, 19, 5, 16, 11, 8, 10, 9, 12, 17]
loan_data.ListingCategory.value_counts()
1 58308 0 16965 7 10494 2 7433 3 7189 6 2572 4 2395 13 1996 15 1522 18 885 14 876 20 771 19 768 5 756 16 304 11 217 8 199 10 91 9 85 12 59 17 52 Name: ListingCategory, dtype: int64
Let's try displaying this in a more descriptive way. Using the original text to get a more clear picture.
myorder = loan_data.ListingCategory.value_counts().index
fig = plt.figure(figsize=[10,10])
color_base = sb.color_palette()[0]
sb.countplot(data=loan_data, y='ListingCategory', color=color_base);
xlabel_locs =['Debt Consolidation', 'Not Available','Other','Home Improvement','Business','Auto','Personal Loan','Household Expenses',
'Medical/Dental','Taxes','Large Purchases','Wedding Loans','Vacation','Student Use','Motorcycle','Engagement Ring',
'Baby&Adoption','Cosmetic Procedure','Boat','Green Loans','RV']
plt.yticks(xlocs,xlabel_locs);
fig = plt.figure(figsize=[8,8])
sb.countplot(data=loan_data, y='ListingCategory', color=color_base,order=myorder);
The debt consolidation category is ranked highest followed by not-available. It would be worth investigating the LP_InterestandFees and CreditGrades variables associated with this category in multivariate exploration.
fig = plt.figure(figsize=[8,5])
sb.countplot(data=loan_data, x='LoanStatus',color=color_base);
plt.xticks(rotation=90);
sb.countplot(data=loan_data, y='LoanStatus',color=color_base);
In order to know what proportion of the LoanStatus falls into each category I will use relative frequency. One way of doing this is to use Matplotlib to change the tick marks from leveling absolute counts, to relative counts. To do this, I'll create a list of tick positions which I'll do by computing the length of the longest bar in terms of proportion. However, the LoanStatus column contains only categorical nominal variables and as a result has to be represented in a numerical form in order to compute the length of the longest bar for the numerator while the original df data frame gives us the denominator, we use value counts to get the numerator. To achieve this I’ll be using Pandas get_dummies function
LoanStatus =pd.get_dummies(loan_data['LoanStatus']).groupby(level =0).sum()
max_loanstatus = LoanStatus['Current'].value_counts()[1]
max_loanstatus
56576
df_original =loan_data.shape[0]
df_original
113937
max_propotion = max_loanstatus/df_original
print(max_propotion)
0.4965551137909546
Since I would like to see the whole distribution it is therefore important to know what the value of the least proportion is in order to accommodate it, if possible, in our numpy arange function.
min_loanstatus = LoanStatus['Past Due (>120 days)'].value_counts()[1]
min_propotion = min_loanstatus/df_original
print(min_propotion)
0.00014042848240694418
Since the min_propotion is between 0 and 0.1. I will not be including it in the xticks of the figure, it will be part of my np.arange function, however.
tick_props = np.arange(0, max_propotion, 0.1)
tick_names =['{:0.2f}'.format(v) for v in tick_props]
tick_names
['0.00', '0.10', '0.20', '0.30', '0.40']
loanstatus_order =loan_data['LoanStatus'].value_counts().index
color_base = sb.color_palette()[0]
sb.countplot(data= loan_data, y= 'LoanStatus', color=color_base, order =loanstatus_order);
plt.xticks(tick_props*df_original, tick_names);
plt.xlabel('proportion');
The resulting plot shows us that more than 40% of all loanstatus is current while about 35% of loanstatus is
completed. There are also about 10% defaulted loanstatus. The rest taking up the 15% left.
fig=plt.figure(figsize=[8,5]);
plt.hist(data=loan_data, x='Investors', bins=20);
plt.xlabel('Investors')
plt.ylabel('Frequency');
The distribution of Investors values appears right skewed, with relatively few points above 400 in value but there isn't a lot of detail beyond the 400 value.
By default, matplotlib will split the data into 10 bins, which is usually too few but in this case too small because we would need to zoom in. We can use the matplotlib function xlim to set the upper and lower bounds of the bins that will be displayed in the plot. This function takes an argument, a list, or tuple of two values specifying these limits. Here, I'll make use of Numpy's arange function to create bins of size thirty. The first argument is the minimum value, while the second argument, will be the maximum value. The third argument gives the step-size for the bins.
I will add plus thirty to the second argument. This is because the values generated by arange will not include the maximum value. By adding plus thirty, this means that all values should be represented in the chart. When I add the bins list to the Hist Function, this creates a view of the data that implies the data is right skewed, save for the presence of possible outliers beyond the 600 value range.
loan_data.Investors.max()
1189
fig = plt.figure(figsize=[8,6])
bins = np.arange(0, loan_data['Investors'].max()+30, 30)
plt.hist(data =loan_data, x='Investors', bins=bins)
plt.xlim((0,600));
plt.xlabel('Number of Investors')
plt.ylabel('count')
Text(0, 0.5, 'count')
sb.distplot(loan_data['Investors'])
plt.ylabel('count');
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
fig=plt.figure(figsize=[8,5])
sb.distplot(loan_data['Investors'],kde=False);
plt.ylabel('count');
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
As can be seen from the plots above, the right skewed histograms indicate most loans have a small number of investors. We will check how the number of investors affect the BorrowerRate of the loans.
plt.figure(figsize=[10,8])
bins = np.arange(0, loan_data['MonthlyLoanPayment'].max()+30, 30)
plt.hist(data =loan_data, x='MonthlyLoanPayment', bins=bins)
plt.xlim((0,1000));
plt.xlabel('MonthlyLoanPayment ($)')
plt.ylabel('count')
Text(0, 0.5, 'count')
The MonthlyLoanPayment appear partly right skewed with a lot of MonthlyLoanPayment in the lower end. This might be due to its correlation with the Loan Amount. We will observe this at the Bivariate Exploration.
loan_data.BorrowerAPR.max()
0.5122899999999999
bins = np.arange(0, loan_data['BorrowerAPR'].max()+0.025, 0.025)
plt.figure(figsize=[8, 5])
plt.hist(data = loan_data, x = 'BorrowerAPR', bins = bins)
plt.xlabel('BorrowerAPR (%)')
plt.show()
BorrowerAPR is the Borrower's Annual Percentage Rate (APR) for a loan, this is the annual interest rate for a loan. The plot above shows a slightly right skewed graph indicating most loans have less than 0.3% of Borrower APR.
bins = np.arange(0, loan_data['BorrowerRate'].max()+0.020, 0.020)
plt.figure(figsize=[8, 5])
plt.hist(data = loan_data, x = 'BorrowerRate', bins = bins)
plt.xlabel('BorrowerRate (%)')
plt.show()
The Borrower Rate is the interest rate of a loan. Same case as the Borrower APR can be observed. Most of the borrower interest rates can be observed on the lower percentages of the Borrower Rate metric.
loan_data.ProsperRatingNumeric.value_counts()
4 18345 5 15581 6 14551 3 14274 2 9795 1 6935 7 5372 Name: ProsperRatingNumeric, dtype: int64
color_base = sb.color_palette()[0]
sb.countplot(data= loan_data, x= 'ProsperRatingNumeric', color=color_base);
xind=loan_data.ProsperRatingNumeric.value_counts().index
color_base = sb.color_palette()[0]
sb.countplot(data= loan_data, x= 'ProsperRatingNumeric', color=color_base, order=xind);
As indicated on the dictionary the ProsperRating Numeric is credit rating done by prosper on the borrowers. The barchart indicates more of the borrowers have a rating of 4 and less have 7. We can relate this rating with BorrowerRate and see how it affects the distribution of BorrowerRate.
loan_data.ProsperScore.value_counts()
4.0 12595 6.0 12278 8.0 12053 7.0 10597 5.0 9813 3.0 7642 9.0 6911 2.0 5766 10.0 4750 11.0 1456 1.0 992 Name: ProsperScore, dtype: int64
color_base = sb.color_palette()[0]
plt.figure(figsize=[10,5])
sb.countplot(data= loan_data, x= 'ProsperScore', color=color_base);
color_base = sb.color_palette()[0]
plt.figure(figsize=[10,5])
order= loan_data.ProsperScore.value_counts().index
sb.countplot(data= loan_data, x= 'ProsperScore', color=color_base, order=order);
According to the data dictionary the The score ranges from 1-11, with 11 being the best, or lowest risk score. Applicable for loans originated after July 2009.
It looks like for the biggest bulk of loans has a risk score of 4 down to risk score 1 . We will investigate further on how this affects the borrower rate of these loans in bivariate exploration.
loan_data['LoanOriginalAmount'].unique()
array([ 9425, 10000, 3001, ..., 17669, 19432, 4292])
# start with a standard-scaled plot
fig = plt.figure(figsize=[8,8])
bins = np.arange(0, loan_data['LoanOriginalAmount'].max()+500, 500)
plt.figure(figsize=[8, 5])
plt.hist(data = loan_data, x = 'LoanOriginalAmount', bins = bins)
plt.xlabel('LoanOriginalAmount')
plt.show();
<Figure size 576x576 with 0 Axes>
The initial plot of the LoanOriginalAmount shows some immediate points of attention. On the LoanOriginalAmount you would notice sudden spikes and some other very high figures in the far right. It's worth taking a bit of time to identify these outliers and see if they need to be filtered out of the data.
dfselect = loan_data[['ListingNumber','LenderYield', 'BorrowerRate', 'BorrowerAPR', 'LoanOriginalAmount', 'EstimatedLoss']]
dfselectdescribe = dfselect.describe([.25,.75,.9,.95,.99])
dfselectdescribe
| ListingNumber | LenderYield | BorrowerRate | BorrowerAPR | LoanOriginalAmount | EstimatedLoss | |
|---|---|---|---|---|---|---|
| count | 1.139370e+05 | 113937.000000 | 113937.000000 | 113912.000000 | 113937.00000 | 84853.000000 |
| mean | 6.278857e+05 | 0.182701 | 0.192764 | 0.218828 | 8337.01385 | 0.080306 |
| std | 3.280762e+05 | 0.074516 | 0.074818 | 0.080364 | 6245.80058 | 0.046764 |
| min | 4.000000e+00 | -0.010000 | 0.000000 | 0.006530 | 1000.00000 | 0.004900 |
| 25% | 4.009190e+05 | 0.124200 | 0.134000 | 0.156290 | 4000.00000 | 0.042400 |
| 50% | 6.005540e+05 | 0.173000 | 0.184000 | 0.209760 | 6500.00000 | 0.072400 |
| 75% | 8.926340e+05 | 0.240000 | 0.250000 | 0.283810 | 12000.00000 | 0.112000 |
| 90% | 1.090058e+06 | 0.299900 | 0.309900 | 0.345770 | 15000.00000 | 0.147500 |
| 95% | 1.156791e+06 | 0.307700 | 0.317700 | 0.357970 | 20000.00000 | 0.165000 |
| 99% | 1.221572e+06 | 0.340000 | 0.350000 | 0.374530 | 25000.00000 | 0.183000 |
| max | 1.255725e+06 | 0.492500 | 0.497500 | 0.512290 | 35000.00000 | 0.366000 |
percentile = dfselectdescribe.loc['99%']
percentile
ListingNumber 1.221572e+06 LenderYield 3.400000e-01 BorrowerRate 3.500000e-01 BorrowerAPR 3.745300e-01 LoanOriginalAmount 2.500000e+04 EstimatedLoss 1.830000e-01 Name: 99%, dtype: float64
dfselectLim = dfselect[dfselect<=percentile]
dfselectLim.max()
ListingNumber 1.221566e+06 LenderYield 3.400000e-01 BorrowerRate 3.500000e-01 BorrowerAPR 3.745300e-01 LoanOriginalAmount 2.500000e+04 EstimatedLoss 1.830000e-01 dtype: float64
dfselect.hist(bins =30, edgecolor ="white",layout =(1,7), figsize =[20, 5]);
plt.suptitle("Histograms\n (All Data)", y=1.05, weight ="bold")
plt.tight_layout()
plt.show()
dfselectLim.hist(bins =30, edgecolor ="white",layout =(1,7), figsize =[20, 5]);
plt.suptitle("Histograms\n (less than 99th pecentile)", y=1.05, weight ="bold")
plt.tight_layout()
plt.show()
dfselectLim.LoanOriginalAmount.hist(bins =30, edgecolor ="white", figsize =[20, 10]);
$10000 point, and few on the high end. When plotted on a log-scale, the LoanOriginalAmount distribution looks roughly bimodal, with one peak between $2000 and $5000, and a second peak a little below $10000. Interestingly, there's a steep jump in frequency right before the $10000, rather than a smooth ramp up!loan_data.DebtToIncomeRatio.value_counts()
0.18000 4132
0.22000 3687
0.17000 3616
0.14000 3553
0.20000 3481
...
0.13774 1
0.04983 1
0.15248 1
0.19893 1
0.06056 1
Name: DebtToIncomeRatio, Length: 1207, dtype: int64
abovedebt = loan_data.DebtToIncomeRatio.describe([.25,.75,.9,.95,.99])
percentile = abovedebt.loc['99%']
percentile
dfLim = loan_data[loan_data.DebtToIncomeRatio<=percentile]
dfLim
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113932 | E6D9357655724827169606C | 753087 | 2013-04-14 05:55:02.663000000 | NaN | 36 | Current | NaN | 0.22354 | 0.1864 | 0.1764 | ... | -75.58 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 113933 | E6DB353036033497292EE43 | 537216 | 2011-11-03 20:42:55.333000000 | NaN | 36 | FinalPaymentInProgress | NaN | 0.13220 | 0.1110 | 0.1010 | ... | -30.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 22 |
| 113934 | E6E13596170052029692BB1 | 1069178 | 2013-12-13 05:49:12.703000000 | NaN | 60 | Current | NaN | 0.23984 | 0.2150 | 0.2050 | ... | -16.91 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 119 |
| 113935 | E6EB3531504622671970D9E | 539056 | 2011-11-14 13:18:26.597000000 | NaN | 60 | Completed | 2013-08-13 00:00:00 | 0.28408 | 0.2605 | 0.2505 | ... | -235.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 274 |
| 113936 | E6ED3600409833199F711B7 | 1140093 | 2014-01-15 09:27:37.657000000 | NaN | 36 | Current | NaN | 0.13189 | 0.1039 | 0.0939 | ... | -1.70 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
104336 rows × 81 columns
bins = np.arange(0,dfLim.DebtToIncomeRatio.max()+0.02,0.02)
plt.figure(figsize=[10,8])
plt.hist(data=dfLim, x='DebtToIncomeRatio',bins=bins);
The DebtToIncomeRatio is right skewed which means most people have more income than debt with 99% being below 0.86. More in depth look into the relation between the ratio and the Loan Original Amount and the borrower rate assigned.
loan_data.CreditScoreRangeUpper.value_counts()
699.0 16492 679.0 16366 719.0 15471 739.0 12923 659.0 12199 759.0 9267 779.0 6606 799.0 4624 639.0 4172 619.0 3602 819.0 2644 539.0 1593 559.0 1474 839.0 1409 579.0 1357 599.0 1125 859.0 567 519.0 554 499.0 346 879.0 212 479.0 141 19.0 133 459.0 36 899.0 27 439.0 5 379.0 1 Name: CreditScoreRangeUpper, dtype: int64
loan_data.CreditScoreRangeLower.value_counts()
680.0 16492 660.0 16366 700.0 15471 720.0 12923 640.0 12199 740.0 9267 760.0 6606 780.0 4624 620.0 4172 600.0 3602 800.0 2644 520.0 1593 540.0 1474 820.0 1409 560.0 1357 580.0 1125 840.0 567 500.0 554 480.0 346 860.0 212 460.0 141 0.0 133 440.0 36 880.0 27 420.0 5 360.0 1 Name: CreditScoreRangeLower, dtype: int64
plt.figure(figsize=[10,8])
plt.hist(data=loan_data, x='CreditScoreRangeUpper');
plt.figure(figsize=[10,8])
plt.hist(data=loan_data, x='CreditScoreRangeLower');
I converted categorical columns
into ordered categorical type, for better analysis.
There was a need to rename columns:
This was to ensure conformity of the data.
numeric_vars = ['LenderYield', 'ProsperRatingNumeric','Investors', 'BorrowerRate', 'LoanOriginalAmount', 'BorrowerAPR', 'Term', 'EstimatedLoss','ProsperRatingNumeric','ProsperScore','ProsperRatingAlpha','ListingCategory','CreditScoreRangeLower', 'CreditScoreRangeUpper','CurrentDelinquencies','AmountDelinquent','DebtToIncomeRatio','IncomeRange','StatedMonthlyIncome','InvestmentFromFriendsAmount']
# correlation plot
plt.figure(figsize = [15, 8])
sb.heatmap(loan_data[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.show()
With respect to our feature of interest, namely, predicting interest rate ( BorrowerRate), we have from the correlation heat map a very strong positive correlation with LenderYield, BorrowerAPR and EstimatedLoss, while we have a moderate negative correlation with CreditScoreRangeLower and CreditScoreRangeUpper.
The other features examined showed very weak correlation with BorrowerRate and as a result we will first concentrate on the aforementioned variables.
numeric_vars_new = ['LenderYield','BorrowerAPR','BorrowerRate', 'EstimatedLoss']
#, 'CreditScoreRangeLower', 'CreditScoreRangeUpper'
categoric_vars = ['CreditGrade', 'ProsperRatingAlpha','ProsperRatingNumeric','ProsperScore', 'EmploymentStatus']
# plot matrix: sample 200 numeric features so that plots are clearer and
# they render faster
samples = np.random.choice(loan_data.shape[0], 500, replace = False)
df_samp = loan_data.loc[samples,:]
g = sb.PairGrid(data = df_samp, vars = numeric_vars_new)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter)
<seaborn.axisgrid.PairGrid at 0x7f7cd16d5c50>
According to the data dictionary the BorrowerRate is the interest rate for a loan while the BorrowerAPR is the Annual Percentage Rate or a calculated total annual interest rate of the loan.
LenderYield is equal to the interest rate on the loan less the servicing fee.
The Estimated Loss is the estimated principal loss on charge-offs.
The three definitions indicates a strong correlation between the four variables. Let us visualize them to confirm this.
plt.figure(figsize=[25,10])
plt.subplot(1,3,1)
sb.regplot(data=loan_data, x= 'BorrowerRate',y='BorrowerAPR')
plt.subplot(1,3,2)
sb.regplot(data=loan_data, x= 'BorrowerRate',y='LenderYield')
plt.subplot(1,3,3)
sb.regplot(data=loan_data, x= 'BorrowerRate',y='EstimatedLoss')
<AxesSubplot:xlabel='BorrowerRate', ylabel='EstimatedLoss'>
The gradient of the line plotted for all three variables against the BorrowerRate indicate a positive correlation and linear relationship which means that an increase in one of these variables would also mean an increase in the BorrowerRate.
#Let us take a closer look into the creditscoreRangelower&Upper variables against the borrowerRate.
plt.figure(figsize = [20, 5])
xbins=np.arange(0,loan_data.BorrowerRate.max()+0.05,0.05)
plt.subplot(1, 2, 1)
sb.regplot(data=loan_data, x='CreditScoreRangeUpper',y='BorrowerRate', scatter_kws={'alpha':1/20})
plt.subplot(1, 2, 2)
sb.regplot(data=loan_data, x='CreditScoreRangeLower',y='BorrowerRate',scatter_kws={'alpha':1/20})
<AxesSubplot:xlabel='CreditScoreRangeLower', ylabel='BorrowerRate'>
On the plot above we can see a strong negative correlation between both creditscore ranges and the BorrowerRate. This illustrates the effect of the credit score to the interest rate of a loan. More indept visualization is required to check on the distribution of the loans. A combination with the ProsperRatingNumeric can bring more insight into this.
sb.violinplot(data = loan_data, x='ProsperScore', y='BorrowerRate')
<AxesSubplot:xlabel='ProsperScore', ylabel='BorrowerRate'>
g =sb.FacetGrid(data = loan_data, col ='ProsperScore',col_wrap = 4)
g.map(plt.hist, 'BorrowerRate');
On the heat-map we see a strong negative correlation between the prosperscore and the borrowerrate. I have used a FacetGrid to further investigate this and show the correlation more in depth. The grid shows the BorrowerRate count depending on the ProsperScore. We can observe the trend with the lower scores, where the histograms are left-skewed and then becoming more right-skewed as the score improves. This shows that the lower the risk score the more the BorrowerRate and vice-versa
sb.regplot(data=loan_data, y='DebtToIncomeRatio',x='BorrowerRate')
<AxesSubplot:xlabel='BorrowerRate', ylabel='DebtToIncomeRatio'>
sb.regplot(data=dfLim, y='DebtToIncomeRatio',x='BorrowerRate')
<AxesSubplot:xlabel='BorrowerRate', ylabel='DebtToIncomeRatio'>
The relation shows a weak Positive correlation, which a higher DebtToIncomeRatio might result to a slightly higher BorrowerRate . We also see a slight abnomality which shows some of the loans with low DebtToIncomeRatio having a high BorrowerRate Let us further investigate these loans. We can check their relation to LoanOriginalAmount
dflim2 = loan_data.BorrowerRate.describe([.25,.75,.9,.95,.99])
percentile = dflim2.loc['99%']
percentile
0.35
dflim2 = loan_data[loan_data.BorrowerRate>=percentile]
dflim2
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 241 | 2EBC3426392173872AC3827 | 362669 | 2008-07-06 11:49:51.040000000 | E | 36 | Completed | 2011-07-08 00:00:00 | 0.37453 | 0.35 | 0.34 | ... | -19.81 | 0.00 | 0.00 | 0.00 | 0.00 | 1.0 | 0 | 0 | 0.0 | 10 |
| 369 | 00F33430399786922887980 | 389897 | 2008-08-28 14:13:26.207000000 | E | 36 | Chargedoff | 2010-04-09 00:00:00 | 0.37453 | 0.35 | 0.34 | ... | -10.29 | 0.00 | 726.73 | 726.73 | 0.00 | 1.0 | 0 | 0 | 0.0 | 16 |
| 372 | 314E34271460439110FA9B2 | 372402 | 2008-07-23 21:09:47.293000000 | D | 36 | Chargedoff | 2010-07-31 00:00:00 | 0.37453 | 0.35 | 0.34 | ... | -50.66 | 0.00 | 2367.02 | 2367.01 | 0.00 | 1.0 | 0 | 0 | 0.0 | 39 |
| 480 | 01083415991046434E155DD | 297133 | 2008-03-22 02:31:54.603000000 | C | 36 | Completed | 2011-04-02 00:00:00 | 0.37453 | 0.35 | 0.34 | ... | -107.92 | 0.00 | 0.00 | 0.00 | 0.00 | 1.0 | 0 | 0 | 0.0 | 54 |
| 543 | 456D347299603286333614B | 442447 | 2010-01-13 14:11:10.530000000 | NaN | 36 | Chargedoff | 2011-11-25 00:00:00 | 0.37453 | 0.35 | 0.34 | ... | -36.01 | 0.00 | 1953.86 | 1953.85 | 0.00 | 1.0 | 0 | 0 | 0.0 | 97 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113610 | CAA13491131862454BA1371 | 471816 | 2010-08-17 09:23:13.910000000 | NaN | 36 | Completed | 2013-08-25 00:00:00 | 0.38723 | 0.35 | 0.34 | ... | -67.33 | -61.52 | 0.00 | 0.00 | 0.00 | 1.0 | 0 | 0 | 0.0 | 67 |
| 113614 | CABD3413208203621FB889E | 283823 | 2008-02-24 20:38:30.613000000 | HR | 36 | Defaulted | 2011-02-04 00:00:00 | 0.37453 | 0.35 | 0.34 | ... | -25.99 | -58.14 | 372.15 | 0.00 | 116.91 | 1.0 | 0 | 0 | 0.0 | 5 |
| 113617 | C2493464219717831D1AE7D | 426289 | 2009-09-29 12:46:07.890000000 | NaN | 36 | Completed | 2010-02-02 00:00:00 | 0.37453 | 0.35 | 0.34 | ... | -17.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.0 | 0 | 0 | 0.0 | 1 |
| 113674 | CCE03457813831311A59D01 | 418097 | 2009-07-25 06:48:19.003000000 | NaN | 36 | Completed | 2010-11-03 00:00:00 | 0.39153 | 0.35 | 0.34 | ... | -10.64 | 0.00 | 0.00 | 0.00 | 0.00 | 1.0 | 0 | 0 | 0.0 | 21 |
| 113692 | E21834080607214869D918D | 250328 | 2007-12-13 13:54:51.530000000 | E | 36 | Chargedoff | 2009-01-22 00:00:00 | 0.36623 | 0.35 | 0.34 | ... | -13.96 | 0.00 | 1873.62 | 1873.62 | 0.00 | 1.0 | 0 | 0 | 0.0 | 25 |
1955 rows × 81 columns
bins = np.arange(0,dflim2.BorrowerRate.max()+0.02,0.02)
sb.regplot(data=dflim2, y='LoanOriginalAmount',x='BorrowerRate')
<AxesSubplot:xlabel='BorrowerRate', ylabel='LoanOriginalAmount'>
The LoanOriginalAmount is relatively low for this sample of loans. Let us check their relation to the CreditGrade and the ProsperScore
sb.violinplot(data = dflim2, x='ProsperScore', y='BorrowerRate')
<AxesSubplot:xlabel='ProsperScore', ylabel='BorrowerRate'>
sb.violinplot(data = dflim2, x='CreditGrade', y='BorrowerRate')
<AxesSubplot:xlabel='CreditGrade', ylabel='BorrowerRate'>
On further investigation we can see the score and credit grade assigned to most of these loans are low. Which might mean due to the higher risk for the company, the borrower rate assigned was higher.
plt.figure(figsize=[10,8])
sb.scatterplot(data=loan_data, x='BorrowerRate', y='LoanOriginalAmount')
plt.xlim([0, 0.35])
plt.xlabel('BorrowerRate')
#plt.yscale('log')
plt.ylabel('LoanOriginalAmount')
plt.show()
plt.figure(figsize=[10,8])
sb.regplot(data=loan_data, x='BorrowerRate', y='LoanOriginalAmount',scatter_kws={'alpha':1/20})
plt.xlim([0, 0.35])
plt.xlabel('BorrowerRate')
#plt.yscale('log')
plt.ylabel('LoanOriginalAmount')
plt.show()
There appears to be a weak negative correlation between the LoanOriginalAmount and the BorrowerRate.
plt.figure(figsize=[10,8])
sb.violinplot(data=loan_data, x='ProsperScore', y='LoanOriginalAmount')
plt.xlabel('ProsperScore')
#plt.yscale('log')
plt.ylabel('LoanOriginalAmount')
plt.show()
sb.regplot(data=loan_data,x='MonthlyLoanPayment',y='BorrowerRate',scatter_kws={'alpha':1/20});
We can observe a negative correlation between the monthlyloanpayment and BorrowerRate.
# plot matrix of numeric features against categorical features.
# can use a larger sample since there are fewer plots and they're simpler in nature.
categorical_vars_new = ['CreditGrade', 'ProsperRatingNumeric','ProsperRatingAlpha','IncomeRange','LoanStatus', 'EmploymentStatus']
numeric_vars_new = ['LenderYield', 'BorrowerRate', 'BorrowerAPR', 'LoanOriginalAmount', 'EstimatedLoss']
samples = np.random.choice(loan_data.shape[0], 200, replace = False)
df_samp = loan_data.loc[samples,:]
def boxgrid_1(x, y,color):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x, y, color = default_color);
plt.figure(figsize = [10, 10])
g = sb.PairGrid( y_vars = numeric_vars_new, x_vars = categorical_vars_new,data = df_samp,
height = 5, aspect = 1.5)
g.map(boxgrid_1)
plt.xticks(rotation=90);
plt.show();
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
<Figure size 720x720 with 0 Axes>
From the boxplot above it becomes even more clear that both prosperRatingNumeric and prosperRatingAlpha have similiar correlation with borrowerRate, as a result i will use just one in the next plot. it also seem to me that loanstatus is not a good measure for estimating borrowerRate, so i will also drop that. EmploymentStatus doesn't seem to have a strong correlation with borrowerrate, so i'll drop that as well.
Additionally, the BorrowerAPR and BorrowerRate relate to the categories similarly as they are linearly related.I will therefore drop the BorrowerAPR.
# plot matrix of numeric features against categorical features.
# can use a larger sample since there are fewer plots and they're simpler in nature.
categoric_vars_new = ['CreditGrade', 'ProsperRatingAlpha','IncomeRange','ListingCategory']
numeric_vars_new = ['LenderYield', 'BorrowerRate', 'LoanOriginalAmount', 'EstimatedLoss']
samples = np.random.choice(loan_data.shape[0], 300, replace = False)
df_samp = loan_data.loc[samples,:]
def boxgrid_1(x, y, **kwargs):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x, y, color = default_color)
plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = df_samp, y_vars = numeric_vars_new, x_vars = categoric_vars_new,
height = 3, aspect = 1.5)
g.map(boxgrid_1)
plt.xticks(rotation=90);
plt.show();
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning /home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
<Figure size 720x720 with 0 Axes>
The plot above shows us a clearer picture into the relationship between the categorical variables with the numerical variables. We can observe the relationship between the Income Range and the Loan Original Amount where borrowers with higher income ranges take or borrow bigger loan amounts as compared to those with a low income range. The box plot boundaries indicating more loan amounts increase gradually as the income range. This shows a positive correlation between the two variables. We can see however a negative correlation to the borrower rate and lender yield which indicates borrowers with higher income ranges get lower borrower rates and in turn lower lender yields.
We can also observe negative correlation between the borrower rate and lender yield variables with Credit Grade and Prosper Rating Alpha. And positive correlation between the Credit Grade and ProsperRating Alpha variables with the loan amount indicating higher credit rating can allow borrowers to acquire higher loan amounts.
The relationship between the numerical variables and the Listing category however needs more indepth analysis to show its effect on them.
sb.boxplot(data=loan_data,x='CreditGrade', y='EstimatedLoss')
<AxesSubplot:xlabel='CreditGrade', ylabel='EstimatedLoss'>
loan_data[['CreditGrade','EstimatedLoss']].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditGrade 28953 non-null category 1 EstimatedLoss 84853 non-null float64 dtypes: category(1), float64(1) memory usage: 1001.9 KB
mydf = loan_data[loan_data['CreditGrade'].isna()]
mydf[['CreditGrade','EstimatedLoss']].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 84984 entries, 1 to 113936 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditGrade 0 non-null category 1 EstimatedLoss 84853 non-null float64 dtypes: category(1), float64(1) memory usage: 1.4 MB
adf = loan_data[loan_data['EstimatedLoss'].isna()]
adf[['CreditGrade','EstimatedLoss']].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 29084 entries, 0 to 113927 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditGrade 28953 non-null category 1 EstimatedLoss 0 non-null float64 dtypes: category(1), float64(1) memory usage: 483.2 KB
It appears that for each EstimatedLoss Entry the CreditGrade column entry is null. This shows an issue with the data source. Therefore we will use the ProsperRatingNumeric as a replacement for CreditGrade as they both have the same grading system according to the data dictionary. Lets take a look at ProsperRatingNumeric.
plt.figure(figsize=[20,10])
plt.subplot(1,2,1)
sb.boxplot(data=loan_data,x='ProsperRatingNumeric', y='EstimatedLoss')
plt.subplot(1,2,2)
sb.boxplot(data=loan_data,x='ProsperScore', y='EstimatedLoss')
<AxesSubplot:xlabel='ProsperScore', ylabel='EstimatedLoss'>
mydf = loan_data[loan_data['EstimatedLoss'].notnull()]
mydf[['ProsperRatingNumeric','EstimatedLoss']].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 84853 entries, 1 to 113936 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProsperRatingNumeric 84853 non-null category 1 EstimatedLoss 84853 non-null float64 dtypes: category(1), float64(1) memory usage: 1.4 MB
isdf = loan_data[loan_data['ProsperRatingAlpha'].notnull()]
isdf[['ProsperRatingAlpha','EstimatedLoss']].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 84853 entries, 1 to 113936 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProsperRatingAlpha 84853 non-null category 1 EstimatedLoss 84853 non-null float64 dtypes: category(1), float64(1) memory usage: 1.4 MB
This served our goal better. We can observe that the higher the rating assigned to borrowers, the lower EstimatedLoss for the loans. Also we can see a negative correlation between the variables prosper score and estimated loss which indicates the better the risk score for a loan the less the estimated loss.
bins = np.arange(.1, 1+.1, .1)
g =sb.FacetGrid(data = loan_data, col ='EmploymentStatus',col_wrap = 4)
g.map(plt.hist, 'BorrowerRate', bins =bins)
<seaborn.axisgrid.FacetGrid at 0x7f7cbfc6c828>
In the plot above we can observe the difference in interest rates per employment status. Right skewed histograms can be observed for employment statuses self-employed, employed, full-time and other. Which indicates more favorable borrower rates for these employment statuses. An indepth analysis is however required. I will combine the employment statuses with their income ranges to get a better picture into the relationship between the two variables.
occups = loan_data.Occupation.unique()
print(len(occups))
68
loan_data.Occupation.value_counts()
Other 28617
Professional 13628
Computer Programmer 4478
Executive 4311
Teacher 3759
...
Dentist 68
Student - College Freshman 41
Student - Community College 28
Judge 22
Student - Technical School 16
Name: Occupation, Length: 67, dtype: int64
In the data we have 68 Occupations and a ProsperScore from 1-11.
Let us explore the risk factor(ProsperScore) calculated for each occupation. Check which occupation has a higher risk score. Then we can check each risk score and the count of the borrowerrate assigned to loans for each score.
This relation can help us determine which occupation has a likeliness to get a higher borrowerrate.
g =sb.FacetGrid(data = loan_data, col ='Occupation',col_wrap = 4)
g.map(sb.countplot, 'ProsperScore')
g.set_axis_labels(x_var='ProsperScore', y_var='Count');
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/axisgrid.py:670: UserWarning: Using the countplot function without specifying `order` is likely to produce an incorrect plot. warnings.warn(warning)
Occups = loan_data.Occupation.value_counts()
Occdesc = Occups.describe([.25,.50,.75,.9,.95,.99])
Perc = Occdesc.loc['50%']
Occupation_ = Occups[Occups>Perc]
myoccupations=Occupation_.index
myoccupations
Index(['Other', 'Professional', 'Computer Programmer', 'Executive', 'Teacher',
'Administrative Assistant', 'Analyst', 'Sales - Commission',
'Accountant/CPA', 'Clerical', 'Sales - Retail', 'Skilled Labor',
'Retail Management', 'Nurse (RN)', 'Construction', 'Truck Driver',
'Laborer', 'Police Officer/Correction Officer', 'Civil Service',
'Engineer - Mechanical', 'Military Enlisted', 'Food Service Management',
'Engineer - Electrical', 'Food Service', 'Medical Technician',
'Attorney', 'Tradesman - Mechanic', 'Social Worker', 'Postal Service',
'Professor', 'Realtor', 'Doctor', 'Nurse (LPN)'],
dtype='object')
#loan_data.loc[loan_data['Occupation'] == [(x) for x in myoccupations]]
df= loan_data[loan_data['Occupation'].isin(myoccupations)]
g =sb.FacetGrid(data = df, col ='Occupation',col_wrap = 4)
g.map(plt.hist, 'BorrowerRate')
g.set_axis_labels(x_var='BorrowerRate', y_var='Count');
plt.figure(figsize=[10,10])
sb.boxplot(data=df,x='Occupation',y='BorrowerRate')
plt.xticks(rotation=90);
On the plots above we can observe relatively equal borrower rates are assigned to different occupations. This however is not a clear depiction of the relationship, and due to the weakness in correlation between the two variables, I will not be pursuing this line of investigation.
g =sb.FacetGrid(data = loan_data, col ='CreditGrade',col_wrap = 4)
g.map(plt.hist, 'BorrowerRate')
g.set_axis_labels(x_var='BorrowerRate');
g =sb.FacetGrid(data = loan_data, col ='ProsperRatingNumeric',col_wrap = 4)
g.map(plt.hist, 'BorrowerRate')
<seaborn.axisgrid.FacetGrid at 0x7f7cbd378d30>
Due to the similarity in meaning between Credit Grade and Prosper Rating Numeric I have depicted both of them to show the relationship between the borrower rate and these credit rating variables. The change of the histograms on the plots above show lower ratings having borrower rates with a higher percentage, than higher ratings which have lower percentage borrower rates. This indicates the effect of credit rating on the borrower rates, the lower the rating the higher the interest rate and vice versa.
sb.regplot(data=loan_data, x='Investors',y='BorrowerRate',scatter_kws={'alpha':1/20})
<AxesSubplot:xlabel='Investors', ylabel='BorrowerRate'>
As shown earlier in univariate exploration where investors histogram was right skewed histogram showing more borrowers had lower amounts of investors. I sought to show the effect of those investors on the borrower rate of the loans. The graph above shows a weak negative correlation between the variables. Indicating more investors in a loan could decrease slightly the borrower rate assigned to those loans.
loan_data.IncomeRange.unique()
['$25,000-49,999', '$50,000-74,999', 'Not displayed', '$100,000+', '$75,000-99,999', '$1-24,999', 'Not employed', '$0'] Categories (8, object): ['Not displayed' < 'Not employed' < '$0' < '$1-24,999' < '$25,000-49,999' < '$50,000-74,999' < '$75,000-99,999' < '$100,000+']
plt.figure(figsize=[10,10])
sb.boxplot(data=loan_data, x='IncomeRange',y='BorrowerRate')
plt.xticks(rotation=90);
bins = np.arange(.1, 1+.1, .1)
g =sb.FacetGrid(data = loan_data, col ='IncomeRange',col_wrap = 4)
g.map(plt.hist, 'BorrowerRate', bins =bins)
<seaborn.axisgrid.FacetGrid at 0x7f7cbdab2048>
#, hue = 'IncomeRange', palette = 'Blues', linestyles = '', dodge = 0.4
fig = plt.figure(figsize = [12,10])
ax = sb.pointplot(data = loan_data, x = 'IncomeRange', y = 'BorrowerRate')
plt.title('IncomeRange Relation to BorrowerRate')
plt.ylabel('BorrowerRate')
plt.xticks(rotation=90);
plt.show();
The plots above are meant to extrpolate further the relationship between the IncomeRange and our focus variable borrower rate. We can see the histograms for borrower rates get more and more right skewed as the income range increases. The point plot shows a slope to the right from income range 1-24999 as the income range increases to 100000$. Showing clearly the effect of more income on the borrower rate of a loan, where borrowers with higher income ranges are more probable to get lower borrower rates.
Upon my investigations, the following observations were made. The following showed a strong positive correlation with BorrowerRate:
LenderYieldBorrowerAPREstimatedLossWhich means an increase in any of these variables or all would mean a consecutive increase in the BorrowerRate.
The following variables show a moderate correlation with BorrowerRate
CreditScoreRangeUpperCreditScoreRangeLowerThese variables show a weak negative correlation with BorrowerRate.
LoanOriginalAmountInvestorsAccording to the data dictionary the following assessment variables exist; I will list them with their relation to BorrowerRate;
CreditGrade : This is a credit assessment variable. Where the higher the Grade the higher the possibility of getting a lower BorrowerRate and vice versa. This can be observed by the nature of the histograms generated for each grade, they are left skewed for lower grades and right skewed for higher grades.ProsperRatingNumeric : This is also a credit assessment variable. Where the higher the rating the higher the possibility of getting a lower BorrowerRate and vice versa.ProsperScore : This is a risk assessment variable for each customer and loans. The lower the number assigned the more the risk, which as result higher BorrowerRates can be observed according to the histograms and violin plots drawn. The rates reduced as the number increases symbolizing less risk.DebtToIncomeRatio feature show a weak positve correlation to the BorrowerRate. Where an increase in the ratio will result in a slight increase in the BorrowerRate.BorrowerRate a negative correlation can be observed for MonthlyLoanPayment. Where a higher MonthlyLoanPayment can result into a slightly lower BorrowerRate.Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
#x_jitter=0.04, scatter_kws={'alpha':0.1}
g=sb.FacetGrid(data=loan_data,col='ProsperScore', height=3.5, col_wrap=4)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerRate', scatter_kws={'alpha':1/5});
g.set_titles('{col_name}')
g.add_legend();
g.set_xlabels(' Loan Amount ($)')
g.set_ylabels('Borrower Rate')
plt.suptitle('Prosper Score Effect on Relationship between rate and Loan Amount');
plt.subplots_adjust(top=0.85)
The line gradient of the graphs show a change in the correlation between the BorrowerRate and LoanOriginalAmount while the ProsperScore rises. A slight positive correlation can be observed on score 11 showing a higher loan amount can result in a slight increase in the BorrowerRate. While a lower score can result in a higher BorrowerRate but would decrease as the Loan amount increases.
g=sb.FacetGrid(data=loan_data,col='ProsperScore', height=3.5, col_wrap=4)
g.map(sb.regplot, 'DebtToIncomeRatio', 'BorrowerRate', scatter_kws={'alpha':1/5});
g.set_titles('{col_name}')
g.add_legend();
g.set_xlabels('DebtToIncomeRatio')
g.set_ylabels('Borrower Rate')
plt.suptitle('Prosper Score Effect on Relationship between rate and the DebtToIncomeRatio');
plt.subplots_adjust(top=0.85)
g=sb.FacetGrid(data=dfLim,col='ProsperScore', height=3.5, col_wrap=4)
g.map(sb.regplot, 'DebtToIncomeRatio', 'BorrowerRate', scatter_kws={'alpha':1/5});
g.set_titles('{col_name}')
g.add_legend();
g.set_xlabels('DebtToIncomeRatio')
g.set_ylabels('Borrower Rate')
plt.suptitle('Prosper Score Effect on Relationship between rate and the DebtToIncomeRatio');
plt.subplots_adjust(top=0.85)
In an attempt to show the effect of the prosper score to the weak positive correlation between the Borrower Rate and the DebttoIncomeRatio, we can see the risk score weaken further the relation. Which shows that a better risk score that indicates less risk for the loan can reduce the effect of the DebtToIncomeRatio therefore resulting into lower borrower rates for higher debttoincomeratio.
We can see that with a lower prosperscore, the BorrowerRate assigned to loans starts at a higher limit rising from there as the ratio increases. With a higher ProsperScore the BorrowerRate starts at a lower limit rising as the ratio increases.
g=sb.FacetGrid(data=loan_data,col='ProsperRatingNumeric', height=3.5, col_wrap=4)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerRate', scatter_kws={'alpha':1/20});
g.set_titles('{col_name}')
g.add_legend();
g.set_xlabels(' Loan Amount ($)')
g.set_ylabels('Borrower Rate')
plt.suptitle('Prosper Rating Numeric Effect on Relationship between rate and Loan Amount');
plt.subplots_adjust(top=0.85)
On the graph above we can see the concentration of the loans lower to the lower borrower rates limits, which indicates the effect of the higher credit rating for the loans. However, we can also see the weak negative correlation between the borrowerrate and loan amount weakened further as the borrower rate limit is reached. We can see that despite the higher credit rating of the borrowers the loans were not assigned lower borrower rates especially on the highest rating score indicaing a limit to the interest rates
loan_data.IncomeRange.value_counts()
$25,000-49,999 32192 $50,000-74,999 31050 $100,000+ 17337 $75,000-99,999 16916 Not displayed 7741 $1-24,999 7274 Not employed 806 $0 621 Name: IncomeRange, dtype: int64
fig = plt.figure(figsize = [12,10])
ax = sb.pointplot(data = loan_data, x = 'ProsperScore', y = 'BorrowerRate', hue = 'IncomeRange',
palette = 'Blues', linestyles = '', dodge = 0.4)
plt.title('IncomeRange Relation to BorrowerRate and ProsperScore')
plt.ylabel('BorrowerRate')
plt.show();
df_flag = (loan_data['BorrowerRate'] >= 0.1) & (loan_data['BorrowerRate'] <= .4)
df_1c = loan_data.loc[df_flag,:]
fig = plt.figure(figsize = [12,10])
ax = sb.pointplot(data = df_1c, x = 'ProsperRatingNumeric', y = 'BorrowerRate', hue = 'IncomeRange',
palette = 'Blues', linestyles = '', dodge = 0.4)
plt.title('IncomeRange relation to BorrowerRate and ProsperScore')
plt.ylabel('BorrowerRate')
plt.show();
On this plot we can see the strengthening effect of the prosperscore and the income range on the borrower rate. We can observe deeper colored points on the lower end of each score indicating that despite the rating score the higher the income range the less the borrower rate. This effect combined with the decrease in the borrower rates as the risk score improves shows a strengthening factor by the two variables on the borrower rates.
Same can be observed for the prosper rating scale graph.
fig = plt.figure(figsize = [10,8])
ax = sb.pointplot(data = loan_data, x = 'ProsperScore', y = 'BorrowerRate', hue = 'ProsperRatingNumeric',
palette = 'Blues', linestyles = '', dodge = 0.4)
plt.title('IncomeRange Relation to BorrowerRate and ProsperScore')
plt.ylabel('BorrowerRate')
plt.show();
Due to the effect of the rating and risk score on other variables. I decided to combine both to show their effect on each other. We can observe that due to the strong strengthening factor by the two variables, the distribution of the borrower rates appears more predictable. Where low borrower rates were assigned to loans with a high prosper rating score and as the risk score improves also the high borrower rates depleted to show even lower borrower rate for the better risk score. This shows that a loan with a high credit rating and the best risk score will only get lower borrower rates.
df_flag = (loan_data['BorrowerRate'] >= 0.1) & (loan_data['BorrowerRate'] <= .4)
df_1c = loan_data.loc[df_flag,:]
#, hue = 'IncomeRange', palette = 'Blues', linestyles = '', dodge = 0.4
fig = plt.figure(figsize = [12,10])
g=sb.FacetGrid(data=loan_data,col='EmploymentStatus', height=3.5, col_wrap=4)
g.map(sb.pointplot, 'IncomeRange', 'BorrowerRate')
g.set_titles('{col_name}')
g.add_legend();
g.set_xlabels('IncomeRange')
g.set_ylabels('Borrower Rate')
g.set_xticklabels(rotation=90)
plt.suptitle('Relationship between BorrowerRate and the IncomeRange per the EmploymentStatus');
plt.subplots_adjust(top=0.85)
plt.show();
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/axisgrid.py:670: UserWarning: Using the pointplot function without specifying `order` is likely to produce an incorrect plot. warnings.warn(warning)
<Figure size 864x720 with 0 Axes>
Previously we discussed the effect of the income range to the borrower rates. Where the higher the income range the lower the borrower rates.We also decided to investigate further its effect on the employment status relation with the borrower rate. We can therefore observe on the plot above, the better point plot slopes of the employment statuses employed, full-time and part time as the incomeranges increase. Indicating that employed borrowers received lower borrower rates and even lower as the income ranges increased.
# create faceted heat maps on levels of the 'CreditGrade' variable
bin_x =np.arange(0, .50+.3, .03)
bin_y =np.arange(0, .50+.3, .03)
g = sb.FacetGrid(data = loan_data, col = 'ProsperScore', col_wrap = 4, size = 3)
g.map(sb.regplot, 'EstimatedLoss', 'BorrowerRate' , scatter_kws={'alpha':1/5})
g.set_xlabels('EstimatedLoss')
g.set_ylabels('BorrowerRate')
plt.show()
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/axisgrid.py:337: UserWarning: The `size` parameter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning)
On the plot above we can see the effect of the prosperscore on the relationship between estimated loss and borrowerrate. We see that the relationship between the two variables does not weaken, however, we can also see the change in the concentration of loans as the prosper risk score improves. We can see a lower concentration of loans on the higher borrower rates and estimated loss scale as the risk score improves. Indicating lower estimated loss and lower borrower rates for loans with better risk scores.
To get a clear picture we will need to use color palettes and differentiate the loans.
g = sb.FacetGrid(data = loan_data, hue = 'ProsperScore', size = 5)
g.map(plt.scatter, 'EstimatedLoss', 'BorrowerRate')
g.set_xlabels('EstimatedLoss')
g.set_ylabels('BorrowerRate')
g.add_legend()
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/axisgrid.py:337: UserWarning: The `size` parameter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning)
<seaborn.axisgrid.FacetGrid at 0x7f7cbfc6c470>
The plot above better shows the effect demonstrated before. The concentration of loans with a lower borrower rate and estimated loss are those with the best risk score. The color palette also shows a change in the score for higher borrower rates.
g = sb.FacetGrid(data = loan_data, hue = 'ProsperRatingNumeric', size = 5)
g.map(plt.scatter, 'EstimatedLoss', 'BorrowerRate')
g.set_xlabels('EstimatedLoss')
g.set_ylabels('BorrowerRate')
g.add_legend()
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/axisgrid.py:337: UserWarning: The `size` parameter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning)
<seaborn.axisgrid.FacetGrid at 0x7f7ccc95fc50>
The same effect demonstrated by the risk score can be seen on the graph above. Showing the effect of the prosper rating to the borrower rates and estimated loss.
# create faceted heat maps on levels of the 'CreditGrade' variable
#bins =[bin_x,bin_y],
plt.figure(figsize=[20,10])
bin_x =np.arange(0, .50+.3, .03)
bin_y =np.arange(0, .50+.3, .03)
g = sb.FacetGrid(data = loan_data, col = 'ProsperRatingNumeric', col_wrap = 4, size = 3)
g.map(plt.hist2d, 'BorrowerAPR', 'BorrowerRate', bins =[bin_x,bin_y], cmin=0.5, cmap='viridis_r')
plt.colorbar()
g.set_xlabels('BorrowerAPR')
g.set_ylabels('BorrowerRate')
plt.show();
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/axisgrid.py:337: UserWarning: The `size` parameter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning)
<Figure size 1440x720 with 0 Axes>
Earlier we saw the strong positive correlation between the BorrowerAPR and the Borrower Rate, here we seek to investigate the effect of the prosperrating numeric on that relationship. As observed we can see the heat map indicate a concentration change on the loans as the rating score changes. The concentration of the loans lower to the lower borrower rates and borrower APR as the rating improves.
# create faceted heat maps on levels of the clarity variable
bin_x =np.arange(0, .50+.3, .03)
bin_y =np.arange(0, .50+.3, .03)
g = sb.FacetGrid(data = loan_data, col = 'IncomeRange', col_wrap = 4, height = 3)
g.map(plt.hist2d, 'BorrowerAPR', 'BorrowerRate', bins =[bin_x,bin_y], cmin=0.5, cmap='viridis_r')
plt.colorbar()
g.set_xlabels('BorrowerAPR')
g.set_ylabels('BorrowerRate($)')
plt.show()
We can also observe on the graph above the same effect demonstrated by prosperratingnumeric, as the income range improves the loans concentrate more on the lower Borrowerrates and BorrowerAPR scales.
# create faceted heat maps on levels of CreditGrade and EmploymentStatus
bin_x =np.arange(0, .50+.3, .03)
bin_y =np.arange(0, .50+.3, .03)
g = sb.FacetGrid(data = loan_data, col = 'ProsperRatingNumeric', row = 'IncomeRange', height = 2.5,
margin_titles = True)
g.map(plt.hist2d, 'BorrowerAPR', 'BorrowerRate', bins =[bin_x,bin_y], cmin=0.5, cmap='viridis_r')
plt.colorbar()
g.set_xlabels('BorrowerAPR')
g.set_ylabels('BorrowerRate')
plt.show()
On the plot above, we can observe the strengthening factor of the income range and the prosper rating numeric on the Interest rates. We can see the concentration of the loans with the lowest credit rating and income range is on the highest side of the borrower rates and borrower APR scales. The concentration also changes to the lowest side of the BorrowerRates and BorrowerAPR scales as the credit rating and the income range improve. This shows that higher credit rates and income ranges will result into lower Interest rates.
g = sb.FacetGrid(data = loan_data, col = 'EmploymentStatus',col_wrap=4, size = 3)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerRate', scatter_kws={'alpha':1/20})
g.set_xlabels('LoanOriginalAmount')
g.set_ylabels('BorrowerRate')
plt.show()
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/axisgrid.py:337: UserWarning: The `size` parameter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning)
On this plot we show the change of the relationship between the BorrowerRate and Loan Amount depending on the employment status of the borrower. We see a contrast in the gradient of the slopes for employment statuses; employed, other, full-time, retired, part-time and self-employed to the statuses not-available and not-employed . We can see that borrowers with the status not-employed and not available started at higher borrower rates and showed weaker negative correlation as compared to the others.
g = sb.FacetGrid(data = loan_data, col = 'EmploymentStatus',col_wrap=4, size = 3)
g.map(sb.regplot, 'EstimatedLoss', 'BorrowerRate', scatter_kws={'alpha':1/20})
g.set_xlabels('EstimatedLoss')
g.set_ylabels('BorrowerRate')
plt.show()
/home/amicah/anaconda3/envs/data_analysis/lib/python3.6/site-packages/seaborn/axisgrid.py:337: UserWarning: The `size` parameter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning)
loan_data.EmploymentStatus.value_counts()
Employed 67322 Full-time 26355 Self-employed 6134 Not available 5347 Other 3806 Part-time 1088 Not employed 835 Retired 795 Name: EmploymentStatus, dtype: int64
On further exploration, we can observe the effect of a ProsperScore on the relation between the EstimatedLoss and the BorrowerRate. The concentration of loans with a lower BorrowerRate and low EstimatedLoss can be observed increasing for higher ProsperScores.
The variables ProsperRatingNumeric and ProsperScore seem to strengthen each others influence on the BorrowerRate assigned. Where a higher ProsperScore and ProsperRatingNumeric resulted into a lower BorrowerRate.
More so we can observe that for higher IncomeRanges and CreditGrade, more loans had a lower BorrowerRate as compared to those with lower IncomeRanges and CreditGrade. An additional correlation with the BorrowerAPR affirms this observation.
Combining the LoanOriginalAmount and the EmploymentStatus showed a steeper gradient for employed Borrowers as compared to Unemployed. With the correlation being negative. This means a higher BorrowerRate can be expected for unemployed individuals as compared to employed.
Due to the value count of the borrowers employment status being high for employed, full time and self employed as compared to the others combination of such as the EmploymentStatus of an individual combined with the EstimatedLoss and BorrowerRate of the loans showed little result in the graph.
Additionally, with the BorrowerRate and LoanOriginalAmount variables combined with the ProsperScore, the gradient of the lines change gradually from a negative correlation to a weak positive correlation at score 10 and 11. Showing a higher loan amount can result in a slight increase in the BorrowerRate, if the ProsperScore is high enough.
The prosper loans dataset contains over 100k observations with 81 variables spanning across 9 years.Understanding the variables, terminology and general domain knowledge of financial peer-to-peer lending was the first obstacle in approaching this dataset. After resolving this, I assessed and cleaned the data for any issues.
My main investigation was geared towards predicting loan BorrowerRate (Interest Rate) based on its relation to the other vairables.
The main variables of interest I identified were:
CreditGradeLoanStatusBorrowerAPRLenderYieldBorrowerStateProsperRatingNumericProsperScoreLoanOriginalAmountInvestorsEmploymentStatusCreditScoreRangeUpperCreditScoreRangeLowerEstimatedLossFor each of these variables I first performed Univariate Exploration of them to identify their characteristics. For instance, I realized the CreditGrade, ProsperRatingNumeric and ProsperRatingAlpha are based on the same concept. Which according to the dictionary are grading variables. I then proceeded to visualize their nature and check on their value_counts.
Then I matched up the data against the BorrowerRates and each other in Bivariate Exploration. This led to significant discoveries such as the necessity to use ProsperRatingNumeric instead of CreditGrade while comparing it to the Estimated Loss. Since for each entry on the EstimatedLoss field a null entry was observed for CreditGrade.
Main discoveries were made on this stage which include,
However, the one ongoing hurdle was determining which variables to analyze, not drifting too far off any one path of investigation and not pulling in new variables throughout the process. Another persistent issue was overplotting on scatterplots, a number of techniques were used across multiple plots to acquire the results.
Additional Variable I considered are:
Success was found in many areas, the general analysis revealed areas of interests such as negative correlation between ProsperScore and BorrowerRate which showed a trend of decrease of the BorrowerRate dependent on the increase in the ProsperScore. Also, trends were confirmed and unexpected, unknown relationships such as those between loan amount and the number of investors contributing to that loan, were revealed.
I also categorize success as to areas that were discovered which need to be further investigated. I believe additional time in multivariate analysis on variables such as occupation, income range, loan category and BorrowerRate would expose more trends and perhaps allow for more predictions. Also, having only analysed 18 of the original 81 variables leaves a lot of undiscovered relationships/correlations and trends.
Additional data would also enhance this dataset. Having the borrower’s age and sex would allow analysis to possibly discover trends among men and women or young and old. Also, population and state-average-income features, would allow for discovery of the type of environment the borrower lived in. For example, a borrower at $75,000 income living in California could be considered middle class within that State. This is in comparison to a borrower from Texas in the same income range bracket which might be consider in the upper class - Does your class (lower, middle or upper) help determine your BorrowerRate? This would be dependent on the borrower’s state and the state’s average income range. These types of questions and more could be answered with additional data.